Hardening Fingerprinting by Context

نویسندگان

  • Aleksander Kolcz
  • Abdur Chowdhury
چکیده

Near-duplicate detection is not only an important pre and post processing task in Information Retrieval but also an effective spam-detection technique. Among different approaches to near-replica detection methods based on document signatures are particularly attractive due to their scalability to massive document collections and their ability to handle high throughput rates. Their weakness lies in the potential brittleness of signatures to small changes in content, which makes them vulnerable to various types of noise. In the important spam-filtering application, this vulnerability can also be exploited by dedicated attackers aiming to maximally fragment signatures corresponding to the same email campaign. We focus on the I-Match algorithm and present a method of strengthening it by considering the usage context when deciding which portions of a document should affect signature generation. This substantially (almost 100-fold in some cases) increases the difficulty of dedicated attacks and provides effective protection against document noise in non-adversarial settings. Our analysis is supported by experiments using a real email collection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DNA Fingerprinting Based on Repetitive Sequences of Iranian Indigenous Lactobacilli Species by (GTG)5- REP-PCR

Background and Objective: The use of lactobacilli as probiotics requires the application of accurate and reliable methods for the detection and identification of bacteria at the strain level. Repetitive sequence-based polymerase chain reaction (rep-PCR), a DNA fingerprinting technique, has been successfully used as a powerful molecular typing method to determine taxonomic and phylogenetic relat...

متن کامل

Geodabs: Trajectory Indexing Meets Fingerprinting at Scale

Finding trajectories and discovering motifs that are similar in large datasets is a central problem for a wide range of applications. Solutions addressing this problem usually rely on spatial indexing and on the computation of a similarity measure in polynomial time. Although effective in the context of sparse trajectory datasets, this approach is too expensive in the context of dense datasets,...

متن کامل

On the optimum die angle in rod drawing process considering strain-hardening effect of material

In this paper, rod drawing process of strain-hardening materials is investigated by analytical, numerical and experimental methods. The classic upper bound solution, based on the assumption of perfect plasticity, has been extended to consider the work-hardening of the material during the drawing process. For a given process conditions and mechanical properties of the rod material, the power ter...

متن کامل

A Crystal Plasticity View of Powder Compaction

The cold compaction of an aggregate of powder is treated from the viewpoint of crystal plasticity theory. The contacts between particles are treated as compaction planes which yield under both normal and shear straining. The hardening of each plane represents both geometric and material hardening at the contacts between particles; the macroscopic tangent stiffness can be written in terms of the...

متن کامل

Two-Level Fingerprinting Codes: Non-Trivial Constructions

We extend the concept of two-level fingerprinting codes, introduced by Anthapadmanabhan and Barg (2009) in context of traceability (TA) codes [1], to other types of fingerprinting codes, namely identifiable parent property (IPP) codes, secure-frameproof (SFP) codes, and frameproof (FP) codes. We define and propose the first explicit non-trivial construction for two-level IPP, SFP and FP codes.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007